Reduce GOP decoder stream synchronizations by KazukiKomon · Pull Request #33 · NVIDIA/ACCV-Lab

KazukiKomon · 2026-06-11T03:53:47Z

Summary

Enable async/event-backed NvDecoder frame copies for GOP decoding.
Move raw YUV output copy onto the decoder stream.
Synchronize once at DecProc completion instead of synchronizing inside each displayed frame copy.

Verification

git diff --check

Discussion

This draft mirrors the internal MR diff for discussion. One point to review is that enabling bEnableAsyncAllocations activates the existing async allocation path in NvDecoder.

xupinjie · 2026-06-11T07:24:51Z

/build

xupinjie · 2026-06-11T07:56:17Z

Why change this param ?

This parameter enables NvDecoder async allocation/event mode. Without it, NvDecoder::HandlePictureDisplay synchronizes the decoder stream for each displayed frame; enabling it lets this path queue the displayed-frame copies and synchronize the decoder stream after DecProc has queued the output copies. I added a code comment at the constructor site to make that intent explicit.

xupinjie · 2026-06-11T08:05:40Z

-    // Copy the decode frames from device
-    CUDA_DRVAPI_CALL(cuMemcpyDtoD((CUdeviceptr)pFrame_buffer, (CUdeviceptr)pFrame, decoder->GetFrameSize()));
+    // Keep the decode-buffer copy on the decoder stream. DecProc performs one
+    // terminal stream sync before returning the Python-visible frame.


"one terminal stream sync before returning the Python-visible frame" may be a misleading. There are some other sync when decoder cuda api call.

Agreed. I changed the comment to avoid implying this is the only CUDA sync in the decoder path. It now only states that the raw output copy is queued on the decoder stream and that DecProc synchronizes that stream before returning the Python-visible frame.

KazukiKomon · 2026-06-11T08:12:52Z

/build

xupinjie · 2026-06-11T08:58:12Z

/build

xupinjie · 2026-06-11T09:26:05Z

PR #33 — `bEnableAsyncAllocations=true`: correct, but ~16% slower

Tested on H200 NVL, CUDA 12.9, against PR head g431e59076. Same tree, only the flag flipped.

Verdict: revert the flag to false. It's safe but gives no speed-up — it's a measurable
regression. The rest of the PR (cuMemcpyDtoDAsync + single terminal sync in DecProc) is fine.

Performance — ~16% regression

Per-Decode() for a 36-frame batch, 80 iters × 3 rounds:

Build	min (ms)	median (ms)
`async=false` (per-frame sync)	40.8	43.5
`async=true` (PR)	45.4	50.7

Distributions don't overlap (true min > false median).

Why (nsys, CUDA API call counts)

API	bEnableAsyncAllocations =`true`	bEnableAsyncAllocations =`false`
`cuStreamSynchronize`	1.4k	64k
`cuEventRecord`	126k	63k
`cuMemcpy2DAsync` / `cuLaunchKernel`	identical	identical
`cuMemAllocAsync`	warm-up only	—

The flag just swaps ~63k per-frame cuStreamSynchronize for ~63k per-frame cuEventRecord — no
GPU work removed. The allocation change (its namesake) is irrelevant: buffers are pooled, so zero
allocs in steady state. Dropping the sync lets the host outrun a 2-surface pool, and the cost
reappears as blocking inside the enqueue calls.

KazukiKomon · 2026-06-11T09:45:42Z

/build

KazukiKomon · 2026-06-11T09:46:27Z

Updated in 91d4ed2: reverted bEnableAsyncAllocations back to false based on the benchmark result. The PR now only keeps the raw-path cuMemcpyDtoDAsync change plus the terminal decoder-stream sync in DecProc.

Signed-off-by: hongyizhang <805701948@qq.com>

xupinjie · 2026-06-12T04:11:39Z

Performance Analysis: `cuMemcpyDtoD` → `cuMemcpyDtoDAsync`

Problem

GetYUVFromFrame previously used cuMemcpyDtoD, which targets the CUDA
legacy null stream. The null stream has implicit bidirectional synchronization
with any blocking CUDA stream: each copy must wait for all pending work on
blocking streams to drain, and vice versa. This causes per-frame decoder
latency to be gated by external CUDA workloads sharing the same context.

Additionally, the old DecProc had no stream sync after the last frame's
copy, meaning Decode() could return before the final frame's data was fully
written — a silent race condition.

The fix replaces cuMemcpyDtoD with cuMemcpyDtoDAsync(decoder->GetStream())
so copies are scoped to the decoder stream only, and adds a single
cuStreamSynchronize at the end of DecProc to guarantee all frames are
ready before returning.

Setup

GPU: NVIDIA H200 NVL
Workload: batch of 8 video files, 1 YUV frame each per Decode() call,
40 iterations, median reported
Concurrent load: a background thread running continuous 2048² fp16
matmuls, tested with two stream types

Results

Concurrent load	Old (`cuMemcpyDtoD`)	New (`cuMemcpyDtoDAsync`)
None	5.3 ms	5.5 ms
Non-blocking stream (`torch.cuda.Stream()`)	5.4 ms	5.5 ms
Blocking stream (`cudaStreamCreate`)	62.6 ms (+1083%)	5.4 ms (+0%)

Conclusion

Typical PyTorch training (streams are cudaStreamNonBlocking by
default): no performance difference — the null stream global barrier does
not apply to non-blocking streams.
Blocking stream workloads (legacy NCCL, cuBLAS global stream, custom
CUDA extensions): old code inflates decoder latency ~10×; new code is
unaffected.
Correctness: the new cuStreamSynchronize eliminates a race condition
where the last frame's buffer could be read before its copy completed.

xupinjie requested changes Jun 11, 2026

View reviewed changes

KazukiKomon marked this pull request as ready for review June 11, 2026 08:09

KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch 2 times, most recently from 0acdde1 to 431e590 Compare June 11, 2026 08:12

KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch from 431e590 to 91d4ed2 Compare June 11, 2026 09:44

Reduce GOP decoder stream synchronizations

96dfdcc

Signed-off-by: hongyizhang <805701948@qq.com>

KazukiKomon force-pushed the zhy/reduce-gop-decoder-sync branch from 91d4ed2 to 96dfdcc Compare June 11, 2026 10:07

xupinjie approved these changes Jun 12, 2026

View reviewed changes

xupinjie merged commit edae948 into NVIDIA:main Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce GOP decoder stream synchronizations#33

Reduce GOP decoder stream synchronizations#33
xupinjie merged 1 commit into
NVIDIA:mainfrom
KazukiKomon:zhy/reduce-gop-decoder-sync

KazukiKomon commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 11, 2026

Uh oh!

xupinjie Jun 11, 2026

Uh oh!

KazukiKomon Jun 11, 2026

Uh oh!

xupinjie Jun 11, 2026

Uh oh!

KazukiKomon Jun 11, 2026

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 11, 2026

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

KazukiKomon commented Jun 11, 2026

Summary

Verification

Discussion

Uh oh!

xupinjie commented Jun 11, 2026

Uh oh!

xupinjie Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

KazukiKomon Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

xupinjie Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

KazukiKomon Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 11, 2026

PR #33 — bEnableAsyncAllocations=true: correct, but ~16% slower

Performance — ~16% regression

Why (nsys, CUDA API call counts)

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

KazukiKomon commented Jun 11, 2026

Uh oh!

xupinjie commented Jun 12, 2026

Performance Analysis: cuMemcpyDtoD → cuMemcpyDtoDAsync

Problem

Setup

Results

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PR #33 — `bEnableAsyncAllocations=true`: correct, but ~16% slower

Performance Analysis: `cuMemcpyDtoD` → `cuMemcpyDtoDAsync`